160 research outputs found
Handling Concept Drift for Predictions in Business Process Mining
Predictive services nowadays play an important role across all business
sectors. However, deployed machine learning models are challenged by changing
data streams over time which is described as concept drift. Prediction quality
of models can be largely influenced by this phenomenon. Therefore, concept
drift is usually handled by retraining of the model. However, current research
lacks a recommendation which data should be selected for the retraining of the
machine learning model. Therefore, we systematically analyze different data
selection strategies in this work. Subsequently, we instantiate our findings on
a use case in process mining which is strongly affected by concept drift. We
can show that we can improve accuracy from 0.5400 to 0.7010 with concept drift
handling. Furthermore, we depict the effects of the different data selection
strategies
Handling Concept Drift for Predictions in Business Process Mining
Predictive services nowadays play an important role across all business sectors. However, deployed machine learning models are challenged by changing data streams over time which is described as concept drift. Prediction quality of models can be largely influenced by this phenomenon. Therefore, concept drift is usually handled by retraining of the model. However, current research lacks a recommendation which data should be selected for the retraining of the machine learning model. Therefore, we systematically analyze different data selection strategies in this work. Subsequently, we instantiate our findings on a use case in process mining which is strongly affected by concept drift. We can show that we can improve accuracy from 0.5400 to 0.7010 with concept drift handling. Furthermore, we depict the effects of the different data selection strategies
How to Cope with Change? - Preserving Validity of Predictive Services over Time
Companies more and more rely on predictive services which are constantly monitoring and analyzing the available data streams for better service offerings. However, sudden or incremental changes in those streams are a challenge for the validity and proper functionality of the predictive service over time. We develop a framework which allows to characterize and differentiate predictive services with regard to their ongoing validity. Furthermore, this work proposes a research agenda of worthwhile research topics to improve the long-term validity of predictive services. In our work, we especially focus on different scenarios of true label availability for predictive services as well as the integration of expert knowledge. With these insights at hand, we lay an important foundation for future research in the field of valid predictive services
CHALLENGES IN THE DEPLOYMENT AND OPERATION OF MACHINE LEARNING IN PRACTICE
Machine learning has recently emerged as a powerful technique to increase operational efficiency or to develop new value propositions. However, the translation of a prediction algorithm into an operationally usable machine learning model is a time-consuming and in various ways challenging task. In this work, we target to systematically elicit the challenges in deployment and operation to enable broader practical dissemination of machine learning applications. To this end, we first identify relevant challenges with a structured literature analysis. Subsequently, we conduct an interview study with machine learning practitioners across various industries, perform a qualitative content analysis, and identify challenges organized along three distinct categories as well as six overarching clusters. Eventually, results from both literature and interviews are evaluated with a comparative analysis. Key issues identified include automated strategies for data drift detection and handling, standardization of machine learning infrastructure, and appropriate communication and expectation management
Utilizing Concept Drift for Measuring the Effectiveness of Policy Interventions: The Case of the COVID-19 Pandemic
As a reaction to the high infectiousness and lethality of the COVID-19 virus, countries around the world have adopted drastic policy measures to contain the pandemic. However, it remains unclear which effect these measures, so-called non-pharmaceutical interventions (NPIs), have on the spread of the virus. In this article, we use machine learning and apply drift detection methods in a novel way to predict the time lag of policy interventions with respect to the development of daily case numbers of COVID-19 across 9 European countries and 28 US states. Our analysis shows that there are, on average, more than two weeks between NPI enactment and a drift in the case numbers
Detecting Concept Drift With Neural Network Model Uncertainty
Deployed machine learning models are confronted with the problem of changing
data over time, a phenomenon also called concept drift. While existing
approaches of concept drift detection already show convincing results, they
require true labels as a prerequisite for successful drift detection.
Especially in many real-world application scenarios-like the ones covered in
this work-true labels are scarce, and their acquisition is expensive.
Therefore, we introduce a new algorithm for drift detection, Uncertainty Drift
Detection (UDD), which is able to detect drifts without access to true labels.
Our approach is based on the uncertainty estimates provided by a deep neural
network in combination with Monte Carlo Dropout. Structural changes over time
are detected by applying the ADWIN technique on the uncertainty estimates, and
detected drifts trigger a retraining of the prediction model. In contrast to
input data-based drift detection, our approach considers the effects of the
current input data on the properties of the prediction model rather than
detecting change on the input data only (which can lead to unnecessary
retrainings). We show that UDD outperforms other state-of-the-art strategies on
two synthetic as well as ten real-world data sets for both regression and
classification tasks
Human vs. supervised machine learning: Who learns patterns faster?
The capabilities of supervised machine learning (SML), especially compared to human abilities, are being discussed in scientific research and in the usage of SML. This study provides an answer to how learning performance differs between humans and machines when there is limited training data. We have designed an experiment in which 44 humans and three different machine learning algorithms identify patterns in labeled training data and have to label instances according to the patterns they find. The results show a high dependency between performance and the underlying patterns of the task. Whereas humans perform relatively similarly across all patterns, machines show large performance differences for the various patterns in our experiment. After seeing 20 instances in the experiment, human performance does not improve anymore, which we relate to theories of cognitive overload. Machines learn slower but can reach the same level or may even outperform humans in 2 of the 4 of used patterns. However, machines need more instances compared to humans for the same results. The performance of machines is comparably lower for the other 2 patterns due to the difficulty of combining input features
Handling Concept Drifts in Regression Problems -- the Error Intersection Approach
Machine learning models are omnipresent for predictions on big data. One
challenge of deployed models is the change of the data over time, a phenomenon
called concept drift. If not handled correctly, a concept drift can lead to
significant mispredictions. We explore a novel approach for concept drift
handling, which depicts a strategy to switch between the application of simple
and complex machine learning models for regression tasks. We assume that the
approach plays out the individual strengths of each model, switching to the
simpler model if a drift occurs and switching back to the complex model for
typical situations. We instantiate the approach on a real-world data set of
taxi demand in New York City, which is prone to multiple drifts, e.g. the
weather phenomena of blizzards, resulting in a sudden decrease of taxi demand.
We are able to show that our suggested approach outperforms all regarded
baselines significantly
How to Conduct Rigorous Supervised Machine Learning in Information Systems Research: The Supervised Machine Learning Reportcard [in press]
Within the last decade, the application of supervised machine learning (SML) has become increasingly popular in the field of information systems (IS) research. Although the choices among different data preprocessing techniques, as well as different algorithms and their individual implementations, are fundamental building blocks of SML results, their documentation—and therefore reproducibility—is inconsistent across published IS research papers.
This may be quite understandable, since the goals and motivations for SML applications vary and since the field has been rapidly evolving within IS. For the IS research community, however, this poses a big challenge, because even with full access to the data neither a complete evaluation of the SML approaches nor a replication of the research results is possible.
Therefore, this article aims to provide the IS community with guidelines for comprehensively and rigorously conducting, as well as documenting, SML research: First, we review the literature concerning steps and SML process frameworks to extract relevant problem characteristics and relevant choices to be made in the application of SML. Second, we integrate these into a comprehensive “Supervised Machine Learning Reportcard (SMLR)” as an artifact to be used in future SML endeavors. Third, we apply this reportcard to a set of 121 relevant articles published in renowned IS outlets between 2010 and 2018 and demonstrate how and where the documentation of current IS research articles can be improved. Thus, this work should contribute to a more complete and rigorous application and documentation of SML approaches, thereby enabling a deeper evaluation and reproducibility / replication of results in IS research
- …